Motivation

It’s difficult to overstate the extent to which the COVID-19 pandemic has tested the world’s public health infrastructure over the past two years. At the same time, the COVID-19 “experience” has manifested unequally – not just country to country, but city to city, and even borough to borough. As one of the most heterogeneous urban areas in the world, New York City provides a fascinating case study into the ways socioeconomic status may be associated with, or even mediate, disparities in health outcomes. (For instance, it’s already well-documented that income and race, along with socioeconomic privilege and political ideology, drive inequities in COVID vaccination rate across US cities.)

Knowing that socioeconomic factors have historically been associated with health outcomes, we aim to examine relationships between a range of predictors (e.g. race/ethnicity, education, broadband internet access, household income / occupational income score, public vs. private health insurance) and COVID-19 health outcomes – namely, hospitalizations, deaths, and vaccinations.

Initial Questions

Our initial questions are centered on how demographic factors in New York City trend with COVID-19 outcomes. Particularly, what are the key demographic correlates (e.g. race/ethnicity, education) of COVID-19 hospitalizations or vaccinations? Do PUMAs with more individuals on public health insurance fair equally well on key outcomes compared to those with more individuals on private health insurance? How does household income fit into the equation? Broadly, we aim to explore the myriad relationships between predictors and COVID-19 outcomes across New York City census tracts and boroughs.

Data Sources and Cleaning

Data Sources

Integrated Public Use Microdata Series (IPUMS USA)

The Integrated Public Use Microdata Series (IPUMS USA) consists of individual-level data from samples of the US population drawn from the American Community Surveys (ACS) of 2000 - present as well as from fifteen federal censuses from 1850 to 2010. Each record in IPUMS is a person with numerically coded characteristics and “weight” variables indicating how many persons in the population are represented by each record. Samples were created at different times by different investigators, which lead to a variety of documentation conventions. However, IPUMS applies consistent coding and documentation across records to allow for effective analysis over time. A data extraction system exists to allow users to pull particular samples and variables from IPUMS. This project uses demographic and macroeconomic data from the American Community Survey (ACS) 2019 five-year estimate via IPUMS.

While data from IPUMS is recorded at the individual-level, each interview is coded to a particular Public Use Microdata Area (PUMA) geography where the census interviewee’s housing unit was located at the time of interview.

New York City Department of Health and Mental Hygiene (NYC DOHMH)

The New York City Department of Health and Mental Hygiene (NYC DOHMH) is one of the oldest public health agencies in the United States. Among other responsibilities, the DOHMH monitors the spread of infectious disease in NYC. The Department of Health classified the beginning of the COVID-19 pandemic as February 29, 2020, the date of the first laboratory-confirmed case. Since then, the DOHMH has recorded and reported COVID-19 data on a daily, weekly, or monthly basis. These data include cases, hospitalizations, and deaths by borough, modified Zip Code tabulation area (ZCTA), and demographic factors. As NYC has administered vaccinations for COVID-19, these data have been recorded and made available by borough, ZCTA, and demography. This project uses COVID-19 hospitalization rates, death rates, and vaccination rates by ZCTA in NYC.

Baruch College, City University of New York - Geoportal

The Baruch Geoportal, maintained by the Newman Library at Baruch College, is a repository of geospatial resources that includes tabular data sets, tutorials, maps, and crosswalks. This project uses a crosswalk data set from Baruch Geoportal to apportion NYC ZCTAs to PUMAs, enabling analysis of PUMA-coded data from IPUMS alongside COVID-19 outcome data from DOHMH at similar levels of granularity.

Data Cleaning

Pulling Data

As mentioned above, monthly outcome data reported at the ZCTA-level geography were obtained from the NYC DOHMH. First, these data were summed over the time interval March 2020 - Sept 2021 to obtain one cumulative incidence measure per ZCTA. However, the predictor variables from IPUMS were coded to the PUMA-level geography, and so we used the Baruch ZCTA-PUMA crosswalk data set to convert our data into common geographical units.

The following columns were used to convert ZCTA-level outcome data to PUMA-level data:

  • zcta10: ZCTA unique identifier

  • puma10: PUMA unique identifier

  • per_in_puma: percentage of the specified ZCTA located within the specified PUMA

  • per_of_puma: percentage of the specified PUMA occupied by the specified ZCTA

The following shows the mathematical expression used to convert from ZCTA-level outcome data to PUMA-level outcome data:

\[ \sum_{i = 1}^{n} \text{zcta_outcome_data}_{i} \cdot \text{per_in_puma}_{i} \cdot \text{per_of_puma}_{i} = \text{puma_outcome_data} \]

This bridged us towards one cumulative incidence measure per PUMA (per 100,000 people for hospitalizations and deaths, and per 100 people for vaccinations).

We proceeded to clean our demographic dataset from IPUMS, which originally contained over 350,000 interviews from the NYC area:

# Cleaning
jimzip <- function(csv_file, path) {
  # create full path to csv file
  full_csv <- paste0(path, "/", csv_file)
  # append ".zip" to csv file
  zip_file <- paste0(full_csv, ".zip")
  # unzip file
  unzip(zip_file)
  # read csv
  data_extract <- read_csv(csv_file)
  # be sure to remove file once unzipped (it will live in working directory)
  on.exit(file.remove(csv_file))
  # output data
  data_extract
}

# Apply function to filtered census data CSV
census_data <- jimzip("census_filtered.csv", "./data")

# Merging the Outcome Data

# Read in PUMA outcomes data
health_data <-
  read_csv("./data/outcome_puma.csv")

# Merge census data with PUMA outcomes data
merged_data <- merge(census_data, health_data, by = "puma")

# Deprecate census data alone
rm(census_data)

## Cleaning the Data

# Clean the merged census and outcomes data
# Each row represents one 
cleaned_data = 
  merged_data %>% 
  # Remove variables less useful for analysis or redundant (high probability of collinearity with remaining variables)
  select(-serial, -cluster, -strata, -multyear, -ancestr1, -ancestr2, -labforce, -occ, -ind, -incwage, -occscore, -pwpuma00, -ftotinc, -hcovpub) %>% 
  # Remove duplicate rows, if any
  distinct() %>% 
  # Rename variables
  rename(
    borough = countyfip,
    has_broadband = cihispeed,
    birthplace = bpl,
    education = educd,
    employment = empstat,
    personal_income = inctot,
    work_transport = tranwork,
    household_income = hhincome,
    on_foodstamps = foodstmp,
    family_size = famsize,
    num_children = nchild,
    US_citizen = citizen,
    puma_vacc_rate = puma_vacc_per,
    on_welfare = incwelfr,
    poverty_threshold = poverty
  ) %>% 
  # Recode variables according to data dictionary
  mutate(
    # Researched mapping for county
    borough = recode(
      borough,
      "5" = "Bronx",
      "47" = "Brooklyn",
      "61" = "Manhattan",
      "81" = "Queens",
      "85" = "Staten Island"
    ),
    rent = ifelse(
      rent == 9999, 0,
      rent
    ),
    household_income = ifelse(
      household_income %in% c(9999998,9999999), NA,
      household_income
    ),
    on_foodstamps = recode(
      on_foodstamps,
      "1" = "No",
      "2" = "Yes"
    ),
    has_broadband = case_when(
      has_broadband == "20" ~ "No",
      has_broadband != "20" ~ "Yes"
    ),
    sex = recode(
      sex,
      "1" = "Male",
      "2" = "Female"
    ),
    # Collapse Hispanic observation into race observation
    race = case_when(
      race == "1" ~ "White",
      race == "2" ~ "Black",
      race == "3" ~ "American Indian",
      race %in% c(4,5,6) ~ "Asian and Pacific Islander",
      race == 7 & hispan %in% c(1,2,3,4) ~ "Hispanic",
      race == 7 & hispan %in% c(0,9) ~ "Other",
      race %in% c(8,9) ~ "2+ races"
    ),
    birthplace = case_when(
      birthplace %in% 1:120 ~"US",
      birthplace %in% 121:950 ~ "Non-US",
      birthplace == 999 ~"Unknown"
    ),
    US_citizen = case_when(
      US_citizen %in% c(1,2) ~ "Yes",
      US_citizen %in% 3:8 ~"No",
      US_citizen %in% c(0,9) ~ "Unknown"
    ),
    # Chose languages based on highest frequency observed
    language = case_when(
      language == "1" ~ "English",
      language == "12" ~ "Spanish",
      language == "43" ~ "Chinese",
      language == "0" ~ "Unknown",
      language == "31" ~ "Hindi",
      !language %in% c(1,12,43,0,31) ~ "Other"
    ),
    # Collapse multiple health insurance variables into single variable
    health_insurance = case_when(
      hcovany == 1 ~ "None",
      hcovany == 2 && hcovpriv == 2 ~ "Private",
      hcovany == 2 && hcovpriv == 1 ~ "Public"
    ),
    education = case_when(
      education %in% 2:61 ~ "Less Than HS Graduate",
      education %in% 62:64 ~ "HS Graduate",
      education %in% 65:100 ~ "Some College",
      education %in% 110:113 ~ "Some College",
      education == 101 ~ "Bachelor's Degree",
      education %in% 114:116 ~ "Post-Graduate Degree",
      education %in% c(0,1,999) ~ "Unknown"
    ),
    employment = case_when(
      employment %in% c(0,3) ~ "Not in labor force",
      employment == 1 ~ "Employed",
      employment == 2 ~ "Unemployed"
    ),
    personal_income = ifelse(
      personal_income %in% c(9999998,9999999), NA,
      personal_income
    ),
    household_income = ifelse(
      household_income %in% c(9999998,9999999), NA,
      household_income
    ),
    on_welfare = case_when(
      on_welfare > 0 ~ "Yes",
      on_welfare == 0 ~ "No"
    ), 
    poverty_threshold = case_when(
      poverty_threshold >= 100 ~ "Above",
      poverty_threshold < 100 ~ "Below"
    ),
    work_transport = case_when(
      work_transport %in% c(31:37, 39) ~ "Public Transit",
      work_transport %in% c(10:20, 38) ~ "Private Vehicle",
      work_transport == 50 ~ "Bicycle",
      work_transport == 60 ~ "Walking",
      work_transport == 80 ~ "Worked From Home",
      work_transport %in% c(0, 70) ~ "Other"
    )
  ) %>% 
  mutate(education = fct_relevel(education, "Less Than HS Graduate", "HS Graduate", "Some College", "Bachelor's Degree", "Post-Graduate Degree", "Unknown")) %>% 
  # Eliminate columns no longer needed after transformation
  select(-hispan, -hcovany, -hcovpriv) %>% 
  # Relocate new columns
  relocate(health_insurance, .before = personal_income) %>% 
  relocate(poverty_threshold, .before = work_transport) %>% 
  relocate(on_welfare, .before = poverty_threshold) %>% 
  relocate(perwt, .before = hhwt) %>% 
  # Create factor variables where applicable
  mutate(across(.cols = c(puma, borough, on_foodstamps, has_broadband, sex, race, birthplace, US_citizen, language, health_insurance, education, employment, on_welfare, poverty_threshold, work_transport), as.factor)) %>% 
  # Ensure consistent use of percentages
  mutate(
    puma_death_rate = puma_death_rate / 1000,
    puma_hosp_rate = puma_hosp_rate / 1000
  )

# View first few rows
cleaned_data %>% 
  head() %>% 
  knitr::kable()
puma perwt hhwt borough rent household_income on_foodstamps has_broadband family_size num_children sex age race birthplace US_citizen language education employment health_insurance personal_income on_welfare poverty_threshold work_transport puma_death_rate puma_hosp_rate puma_vacc_rate
3701 18 25 Bronx 0 166870 No No 2 0 Male 58 White Non-US Yes English Post-Graduate Degree Employed Private 83435 No Above Public Transit 0.3982366 1.064625 55.79213
3701 22 22 Bronx 1018 123192 No Yes 2 0 Male 52 Black Non-US Yes Other Some College Employed Private 107920 No Above Private Vehicle 0.3982366 1.064625 55.79213
3701 16 17 Bronx 2000 125000 No Yes 1 0 Female 59 White US Unknown Spanish Some College Employed Private 125000 No Above Public Transit 0.3982366 1.064625 55.79213
3701 4 6 Bronx 0 140588 Yes Yes 6 3 Male 64 Asian and Pacific Islander Non-US Yes Chinese Post-Graduate Degree Not in labor force Private 11264 No Above Other 0.3982366 1.064625 55.79213
3701 20 19 Bronx 1058 52876 No Yes 2 1 Male 44 White Non-US Yes Spanish Some College Employed Private 32373 No Above Public Transit 0.3982366 1.064625 55.79213
3701 9 9 Bronx 0 26101 No Yes 1 0 Female 64 White US Unknown English Post-Graduate Degree Employed Private 26101 No Above Private Vehicle 0.3982366 1.064625 55.79213

Because our outcomes data was minimially aggregated at the PUMA level and could not be disaggregated further, we decided it would be prudent to create a second cleaned data set for which each observation corresponded to a given PUMA, rather than a given census observation. To do so, we needed to use perwt (person weight, describing the number of persons in the coded area for whom the interview was representative) and hhwt (household weight, which plays a similar role but by household rather than by person in each geographic area) to weight our key census variables at the aggregated PUMA level.

# Example data frame with weightings for summary stats over each PUMA
nyc_puma_summary = cleaned_data %>% 
  # Note: do we need to filter to one individual per household for household weightings?
  group_by(puma) %>%
  summarize(
    total_people = sum(perwt),
    median_household_income = weighted.median(household_income, hhwt, na.rm = TRUE),
    perc_foodstamps = sum(hhwt[on_foodstamps == "Yes"]) * 100 / sum(hhwt),
    perc_broadband = sum(hhwt[has_broadband == "Yes"]) * 100 / sum(hhwt),
    perc_male = sum(perwt[sex == "Male"]) * 100 / sum(perwt),
    median_age = weighted.median(age, perwt, na.rm = TRUE),
    perc_white = sum(perwt[race == "White"]) * 100 / sum(perwt),
    perc_foreign_born = sum(perwt[birthplace == "Non-US"]) * 100 / sum(perwt),
    perc_citizen = sum(perwt[US_citizen == "Yes"]) * 100 / sum(perwt),
    perc_english = sum(perwt[language == "English"]) * 100 / sum(perwt),
    perc_college = sum(perwt[education %in% c("Some College", "Bachelor's Degree", "Post-Graduate Degree")]) * 100 / sum(perwt),
    perc_unemployed = sum(perwt[employment == "Unemployed"]) * 100 / sum(perwt),
    perc_insured = sum(perwt[health_insurance %in% c("Private", "Public")]) * 100 / sum(perwt),
    median_personal_income = weighted.median(personal_income, perwt, na.rm = TRUE),
    perc_welfare = sum(perwt[on_welfare == "Yes"]) * 100 / sum(perwt),
    perc_poverty = sum(perwt[poverty_threshold == "Below"]) * 100 / sum(perwt),
    perc_public_transit = sum(perwt[work_transport == "Public Transit"]) * 100 / sum(perwt),
    covid_hosp_rate = median(puma_hosp_rate),
    covid_vax_rate = median(puma_vacc_rate),
    covid_death_rate = median(puma_death_rate)
  )

# View first few rows of PUMA-summarized data frame
nyc_puma_summary %>% 
  head() %>% 
  knitr::kable()
puma total_people median_household_income perc_foodstamps perc_broadband perc_male median_age perc_white perc_foreign_born perc_citizen perc_english perc_college perc_unemployed perc_insured median_personal_income perc_welfare perc_poverty perc_public_transit covid_hosp_rate covid_vax_rate covid_death_rate
3701 110016 69000 24.49306 89.50725 46.88500 39 46.53050 34.50407 20.26887 41.74120 47.96393 3.449498 93.91725 22357.09 20.21706 22.66943 24.79821 1.0646245 55.79213 0.3982366
3702 151067 68610 28.24215 82.93904 45.12435 36 14.06925 42.97100 27.86909 66.47183 39.83729 4.534412 92.58276 20859.00 22.08623 18.60036 21.55732 1.1782180 47.19203 0.2688261
3703 119529 77588 15.90074 86.60915 46.66734 43 41.20088 22.69324 16.91138 58.42432 45.13131 3.577374 94.11273 25030.00 16.73067 16.28140 19.85292 1.3040106 58.33806 0.4052719
3704 126664 64662 23.88688 81.56041 47.66311 37 32.72911 37.88685 23.03812 41.10165 40.30664 4.141666 92.33879 20362.00 20.17700 18.97698 23.55681 0.6861973 29.38498 0.2092428
3705 171016 36500 50.66455 87.59563 46.19509 30 15.45177 33.49043 16.46103 33.31033 28.35290 5.289563 91.41484 11264.00 29.20428 39.26943 23.03995 0.8255382 33.17926 0.2018467
3706 131986 43490 46.18132 82.15610 48.26724 32 19.37099 46.80496 20.58855 20.86812 29.15536 5.606655 89.18219 15000.00 25.09812 32.16402 28.97883 0.7723878 32.95661 0.1812561

This cleaned and aggregated data set has 55 rows, one for each PUMA, and is the basis for much of the exploratory analysis that follows.

Exploratory Analysis

Overview of Outcome Variables

Ultimately, we ended the data cleaning process with two data sets at different levels of aggregation: one at the census interview level, and another at the PUMA level. Most of our exploratory analysis focused on evaluating predictors and outcomes by PUMA, although the interview-level data was still helpful for us to determine outcome disparities across predictors city-wide.

We began by seeking a better understanding of how each of our three key outcomes – hospitalization rate, death rate, and vaccination rate – differ by PUMA.

When PUMAs are ranked from lowest outcome rate to highest outcome rate, and colored by borough, we find the following, as visualized in the graphs below:

  • Hospitalization rate ranges from <0.5% to nearly 2%, with the plurality of high hospitalization rates occurring in Queens PUMAs and the plurality of low hospitalization rates occurring in Manhattan and Brooklyn PUMAs
  • Death rate ranges from ~0.05% to ~0.6%, with the plurality of high death rates occurring in Queens PUMAs and the plurality of low death rates occurring in Manhattan and Brooklyn PUMAs
  • Vaccination rate ranges from nearly 30% to over 100% (due to migrations between PUMAs), with all of the highest vaccination rates occurring in Manhattan and Queens, and all of the lowest vaccination rates occurring in Brooklyn and the Bronx

Although our regression modeling explores the relationship between particular variable pairings more fully to check for collinearity, we also were interested in observing how associated our key outcome variables were.

We found that hospitalization and death rate were significantly correlated, whereas vaccination had limited correlation with both hospitalization (0.187) and death (0.075) at the PUMA level. Interestingly, the correlation between our key outcome variables was effectively modified by borough; for example, vaccination and hospitalization were significantly correlated in the Bronx (0.930), whereas hospitalization and death were not significantly correlated in Staten Island (0.526).

Outcomes by Borough

After exploring outcomes at the PUMA level, we were keen to dive more deeply into disparities by borough. Beyond simply confirming with boxplots the distribution of key outcomes across PUMAs in a given borough (e.g. hospitalization and death distributed towards higher rates in Queens and Bronx PUMAs, vaccination distributed towards higher rates in Queens and Manhattan), we were also curious to see what would happen if we took the median city-wide PUMA for each outcome, and binarily divided the PUMAs in each borough into those above or below that median. Although we’re working with a small sample of only 55 PUMAs, which makes our borough percentages highly sensitive to single-PUMA changes, we find – to take one example – that ~79% of PUMAs in Queens are above the city-wide median PUMA on death rate, but ~86% of PUMAs in Queens are above the city-wide median PUMA on vaccination rate.

Outcomes by Demographic Combinations

For each unique combination of age group, race, and sex, we wanted to determine which demographic category performed best and worst on each key outcome, and populated the following tables:

# Lowest hospitalization rates
race_age_sex %>% 
  filter(outcome == "hosp_rate") %>% 
  mutate(
    outcome_rate = outcome_rate
  ) %>% 
  arrange(outcome_rate) %>% 
  select(race, age_class, sex, outcome_rate) %>% 
  head() %>% 
  knitr::kable(
    caption = "Lowest hospitalization rates",
    col.names = c("Race", "Age", "Sex", "% Hospitalized")
  )
Lowest hospitalization rates
Race Age Sex % Hospitalized
White 21-30 Female 0.8761509
White 31-40 Male 0.8859089
White 31-40 Female 0.8908692
White 21-30 Male 0.8963492
White 41-50 Male 0.9275371
White 41-50 Female 0.9347574
# Highest hospitalization rates
race_age_sex %>% 
  filter(outcome == "hosp_rate") %>% 
  mutate(
    outcome_rate = outcome_rate
  ) %>% 
  arrange(desc(outcome_rate)) %>% 
  select(race, age_class, sex, outcome_rate) %>% 
  head() %>% 
  knitr::kable(
    caption = "Highest hospitalization rates",
    col.names = c("Race", "Age", "Sex", "% Hospitalized")
  )
Highest hospitalization rates
Race Age Sex % Hospitalized
American Indian 81-90 Male 1.272494
Other 61-70 Male 1.216953
Other 11-20 Male 1.211757
Other 41-50 Male 1.200270
Other 61-70 Female 1.185197
Asian and Pacific Islander 91-100 Male 1.173874
# Lowest death rates
race_age_sex %>% 
  filter(outcome == "death_rate") %>% 
  mutate(
    outcome_rate = outcome_rate
  ) %>% 
  arrange(outcome_rate) %>% 
  select(race, age_class, sex, outcome_rate) %>% 
  head() %>% 
  knitr::kable(
    caption = "Lowest death rates",
    col.names = c("Race", "Age", "Sex", "% Deceased")
  )
Lowest death rates
Race Age Sex % Deceased
White 21-30 Female 0.2376629
White 31-40 Male 0.2424339
White 21-30 Male 0.2435737
White 31-40 Female 0.2455473
Other 81-90 Male 0.2541018
White 41-50 Male 0.2564677
# Highest death rates
race_age_sex %>% 
  filter(outcome == "death_rate") %>% 
  mutate(
    outcome_rate = outcome_rate
  ) %>% 
  arrange(desc(outcome_rate)) %>% 
  select(race, age_class, sex, outcome_rate) %>% 
  head() %>% 
  knitr::kable(
    caption = "Highest death rates",
    col.names = c("Race", "Age", "Sex", "% Deceased")
  )
Highest death rates
Race Age Sex % Deceased
2+ races 91-100 Male 0.3525028
Asian and Pacific Islander 91-100 Male 0.3386789
American Indian 81-90 Male 0.3246158
Other 11-20 Male 0.3228600
Other 41-50 Male 0.3220782
Other 71-80 Male 0.3202419
# Lowest vax rates
race_age_sex %>% 
  filter(outcome == "vax_rate") %>% 
  mutate(
    outcome_rate = outcome_rate
  ) %>% 
  arrange(outcome_rate) %>% 
  select(race, age_class, sex, outcome_rate) %>% 
  head() %>% 
  knitr::kable(
    caption = "Lowest vaccination rates",
    col.names = c("Race", "Age", "Sex", "% Vaccinated")
  )
Lowest vaccination rates
Race Age Sex % Vaccinated
American Indian 71-80 Male 49.32433
Black 11-20 Female 49.35445
Black <11 Male 49.47578
Black 11-20 Male 49.48017
Black 31-40 Female 49.49496
Black <11 Female 49.51926
# Highest vax rates
race_age_sex %>% 
  filter(outcome == "vax_rate") %>% 
  mutate(
    outcome_rate = outcome_rate
  ) %>% 
  arrange(desc(outcome_rate)) %>% 
  select(race, age_class, sex, outcome_rate) %>% 
  head() %>% 
  knitr::kable(
    caption = "Highest vaccination rates",
    col.names = c("Race", "Age", "Sex", "% Vaccinated")
  )
Highest vaccination rates
Race Age Sex % Vaccinated
Asian and Pacific Islander 91-100 Female 66.13596
Asian and Pacific Islander 71-80 Female 65.89982
Asian and Pacific Islander 71-80 Male 65.73426
Asian and Pacific Islander 31-40 Male 65.61553
Asian and Pacific Islander 31-40 Female 65.57662
2+ races 81-90 Male 65.47910

Overall, hospitalization and death rates were lowest (best) among white males and females under age 30, whereas vaccination rates were highest (best) among Asian and Pacific Islander males and females. Older American Indian individuals, along with younger and middle-aged Black individuals, tended to have the lowest vaccination rates, while mixed race, American Indian, and “other” racial groups tended to have higher hospitalization and death rates.

Associations Between Individual Predictors and Outcomes

After exploring disparities in key outcomes across PUMAs and boroughs, we turn towards our large set of predictors, and begin by trying to determine which predictors to focus on using a correlation matrix (with outcomes). In the plot below, we include text only on those tiles that are highly significant, with p < 0.01.

Interestingly, we find:

  • The variables highly positively correlated with hospitalization rate at the PUMA level are percent US citizens and percent foreign born; the variables highly negatively correlated with hospitalization rate at the PUMA level are percent white, percent using public transit to get to work, percent with health insurance, percent English-speaking at home, percent with college education, percent with broadband access, and median personal and household incomes.
  • The variables highly positively correlated with death rate at the PUMA level, are percent US citizens and percent foreign born; the variables highly negatively correlated with death rate at the PUMA level are percent using public transit to get to work, percent with health insurance, percent college educated, and median personal and household incomes.
  • The variables highly positively correlated with vaccination rate at the PUMA level are percent white, percent male, percent college educated, percent with broadband access, median age, and median personal and household incomes; the variables highly negatively correlated with vaccination rate at the PUMA level are percent on welfare, percent unemployed, percent below the poverty threshold, and percent on food stamps

All in all, there are a couple of interesting trends observed in this correlation matrix. First, income seems strongly associated with all three outcome variables. In addition, the variables strongly associated with hospitalization rate also tend to be strongly associated with death rate, which is not much of a surprise given the high correlation already observed between hospitalization and death rate across PUMAs. Finally, we note that many signifiers of socioeconomic status – including poverty level, welfare, unemployment, and food stamp utilization – seem highly correlated with vaccination, but not with hospitalization and death. Given that vaccination is a more “active” outcome (requiring deliberate action) here than hospitalization or death, which are both the result of (in all likelihood) “passive” or indeliberate transmission, the association of socioeconomic factors with vaccination begin to indicate the potential impact from significant structural inequalities at play, hindering healthcare access among the poor.

We then selected each of the four variables with highest correlation (positive or negative) to each outcome, excluding obvious redundancies (like personal income and household income), and explored specific association trends between predictor and outcome, colored by borough.

Beyond the predictors significantly associated with each outcome, we wanted to focus as well on how outcomes varied by levels of key socioeconomic variables – namely, race, age group, and sex. Because we lack individual outcome data (i.e. each census observation within a given PUMA has the same PUMA-level hospitalization, death, and vaccination rate), we assumed for this analysis that all persons in a given PUMA had equal likelihood of a particular outcome (hospitalization, death, or vaccination) being true, with the likelihood corresponding to the PUMA outcome rate.

Some key findings include:

  • Males and females appear similar on each outcome
  • Hospitalization and death rates tend to be higher after middle age compared to those below middle age, whereas vaccination rate shows a general increase with each age group
  • In general, white individuals have a lower likelihood of hospitalization and death, but a higher likelihood – along with Asian and Pacific Islanders – of vaccination, whereas other groups of color and Native Americans seem to have higher hospitalization and death rates, but lower vaccination rates

Similarly, we examined how key outcomes varied across categories of a few seemingly important predictor variables for each outcome observed in our correlation matrix:

Again, there were a few interesting findings here, such as:

  • We observe a monotonic decrease in hospitalization and death rate as household income increases, but a monotonic increase in vaccination rate as household income increases as well
  • Individuals with public health insurance tend to perform similarly on key outcomes to those with no health insurance at all, with both groups worse than those who have private health insurance
  • Individuals with college and graduate education tend to perform better on key outcomes than those without any college education
  • Those with unknown citizenship status, which may signify being “undocumented,” tend to have lower death rates – perhaps indicative of under-reporting due to fear of immigration control / policing, i.e. attention on the person or family
  • As already noted, those on welfare and foodstamps, as well as unemployed, are significantly less likely to be vaccinated against COVID-19

Associations Between Predictors and Outcomes by Borough

To build on the work above, we wanted to observe disparities within boroughs, knowing that different boroughs have different demographic compositions and that the ideal visualizations would allow us to understand how outcomes vary conditioned on borough demographic composition. Each of the following visualizations has three panels: number of people with outcome, categorized by borough and colored by predictor level; % of people with outcome in each borough, colored by predictor level, compared to the overall composition of the borough by predictor level; and percent with outcome variable, in each borough, plotted by level of predictor.

The most interesting finding here is that Manhattan seems to have the most significant racial disparities on hospitalization rate and death rate, along with age disparities on hospitalization rate, and racial and age disparities in vaccination rate. In general, the level of inequality on key outcomes across demographics in Manhattan tends to be higher than in other boroughs.

Regression

outcome_by_year <- 
  read_csv("./modeling/outcome_puma_by_year.csv") %>%
  rename(puma = puma10)

unbiased_data <- read_csv("./modeling/unbiased_group_means.csv") %>%
  merge(outcome_by_year, by = "puma")

best_model <- lm(puma_hosp_rate_2020  ~ language_spanish + 
                   education_bachelors_degree +
                   birthplace_us + health_insurance_public + language_english + 
                   personal_income + employment_not_in_labor_force +
                   health_insurance_public:personal_income +
                   education_bachelors_degree:personal_income +
                   language_english:birthplace_us, data = unbiased_data)

full_model <-  lm(puma_hosp_rate_2020  ~ 
                   (language_spanish + 
                   education_bachelors_degree +
                   birthplace_us + health_insurance_public + language_english + 
                   personal_income + employment_not_in_labor_force)^2, 
                  data = unbiased_data)

Let \(S\) denote the set of PUMAs in NYC. We consider the latent variable model \[ y_s = \xi_s' \beta + \varepsilon_s \] for each \(s \in S\), where \(\xi_s\) a vector of PUMA-level means of data from the census. There are two challenges in considering such a model: (1) we know historically that NYC neighborhoods, particularly those within boroughs, are subject to spatial correlation; and (2) because the census data is recorded at the interview level, we do not observe the group-level means \(\xi_s\).

In order to address (1), we employ heteroscedasticity-robust White standard errors to deal with the potential threat of spatial correlation. We find that all of our coefficient estimates retain their significance under the adjusted standard errors, and hence we leave this check on robustness to the appendix.

More interesting is the challenge presented in (2). For each \(s \in S\), suppose we observe \(i \in \{ 1, \ldots, n_s \}\) individual-level interviews. It is natural to consider the group mean \[ \bar{X}_s \equiv \frac{1}{n_s} \sum_{i = 1}^{n_s} X_{is}. \] However, as shown in Croon and Veldhoven (2007), it is generally the case that using the observed group mean \(\bar{X}_s\) as an estimator of \(\xi_s\) will lead to biased regression coefficients. Thus, we address (2) by following the procedure outlined by Croon and Veldhoven to re-weight our group means and obtain adjusted group means \(\tilde{X}_s\). In particular, let $ _{}$ and \(\hat{\Sigma}_{\nu \nu}\) denote the usual ANOVA estimates of the between- and within-group variation matrices, respectively. The adjusted group means are then given by \[ \tilde{X}_s \equiv \bar{X}' (I - W_s) + \bar{X}_s' W_s, \quad \text{where} \] \[ W_s \equiv \left( \hat{\Sigma}_{\xi \xi} + \hat{\Sigma}_{\nu \nu} / n_s \right)^{-1} \hat{\Sigma}_{\xi \xi} \] and \(I\) denotes the identity matrix.

Regression Dianostics

lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

best_model_tidy <- fit(lm_spec, puma_hosp_rate_2020 ~ language_spanish + 
                   education_bachelors_degree +
                   birthplace_us + health_insurance_public + language_english + 
                   personal_income + employment_not_in_labor_force +
                   health_insurance_public:personal_income +
                   education_bachelors_degree:personal_income +
                   language_english:birthplace_us, data = unbiased_data)


check_model(best_model_tidy, 
            check = c("linearity", "qq", "normality", "outliers", "homogeneity"))

Linearity: It is to be observed that the relationship between independent and dependent variables is fairly linear in our model. Thus, we claim that the mean of the hospitalization rate in 2020 (outcome) is a linear combination of the model parameters and the predictor variables.

Equal Variance (homoscedasticity): Based on the residual vs fitted plot, it is to be observed that the points in the plot are randomly dispersed around the horizontal line at 0 without showing a certain trend. This implies no violation of homoscedasticity.

Normality of residuals: According to the normality probability plot, it is to be observed most of the points fall along the line. Therefore, we claim that residuals are normally distributed in our model.

Influential points among outliers: According to residuals vs leverage plot, it is to be observed all points are inside the contour lines. Hence, there are no influential observations. Therefore, we claim that the residuals are independent of each other.

Residuals are uncorrelated: On the basis of the autocorrelation plot, it is to be observed that the lags do not have significant effect because they are within the bounds.

summary(best_model) %>% 
  broom::glance() %>%
  bind_rows(summary(full_model) %>% broom::glance()) %>%
  mutate(model = c("Best Model", "Full Model")) %>%
  relocate(model)
## # A tibble: 2 × 9
##   model  r.squared adj.r.squared sigma statistic p.value    df df.residual  nobs
##   <chr>      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>       <int> <dbl>
## 1 Best …     0.669         0.594  140.      8.89 8.90e-8    10          44    55
## 2 Full …     0.798         0.580  142.      3.66 6.78e-4    28          26    55
Cp <- ols_mallows_cp(best_model, full_model)

names(Cp) <- c("Mallows' Cp")

Cp
## Mallows' Cp 
##    9.525754

Adjusted R-squared: We compared our model to the full model which includes all main effects of our model with all possible second-order interaction terms of main effects. It is obvious that the full model has higher R-squared (0.797) than that of our model (0.669). However, our model has a higher adjusted R-squared (0.593) than that of the full model (0.579).

Mallows’ Cp: Based on Mallows’ Cp criterion, we want our model to have Cp value less than or equal to the number of parameters, indicating little or no bias in the regression model. we obtained our Cp) from our model and the full model. The Cp value is 9.5 and it is less than the number of our model parameters (11). Therefore, we claim that our regression model has little or no bias.

TODO: Output

summary(best_model) %>%
  broom::tidy()
## # A tibble: 11 × 5
##    term                                   estimate std.error statistic   p.value
##    <chr>                                     <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                            220.       6.18e+2     0.356   7.23e-1
##  2 language_spanish                      1446.       2.83e+2     5.12    6.50e-6
##  3 education_bachelors_degree           -1386.       1.20e+3    -1.15    2.56e-1
##  4 birthplace_us                        -2324.       5.51e+2    -4.22    1.20e-4
##  5 health_insurance_public              -3614.       7.60e+2    -4.76    2.15e-5
##  6 language_english                      -365.       5.39e+2    -0.678   5.01e-1
##  7 personal_income                         -0.0266   8.04e-3    -3.31    1.86e-3
##  8 employment_not_in_labor_force         4345.       9.56e+2     4.54    4.29e-5
##  9 health_insurance_public:personal_in…     0.0812   1.92e-2     4.22    1.19e-4
## 10 education_bachelors_degree:personal…     0.0554   1.87e-2     2.97    4.81e-3
## 11 birthplace_us:language_english        2145.       9.08e+2     2.36    2.26e-2

Predictive Risk Scoring & Clustering

Risk Scoring

Following our regression work, we decided to make our insights more actionable by developing a novel scoring method capable of indicating whether a PUMA is at relatively lower or higher risk of achieving some COVID-19-related outcome. To demonstrate the method, we developed a risk score for whether a PUMA is likely to have achieved or not achieved a vaccination rate of 70% among residents, corresponding to the level of population immunity generally considered requisite to “halt the pandemic.” That said, our method is applicable to other outcomes as well, and one might imagine using it to score a given PUMA on the possibility of a hospitalization rate above X%, or a death rate above Y%. We only chose to begin with vaccination rate given our inability to develop a linear regression for this particular outcome, as opposed to hospitalization and death rates.

The risk scoring method we developed is predicated on (regularized, lasso) logistic regression, used most often for classification tasks because predictions can be interpreted as class probabilities. Regularization further prevents over-fitting of the model. After defining our outcome as 1 for below 70% vaccination and 0 for equal to or above 70% vaccination, we converted our set of predictors to matrix form, and then trained a glmnet model on our training data using 5-fold cross-validation repeated 100 times given the large number of predictors compared to our mere 55 PUMA samples, and because we were interested in predicting the risk score of vaccination rate for each PUMA in our data set. Through our lasso regression, we found an optimal lambda tuning parameter of 0.0102, and generated a model prediction accuracy of 0.886 (~87%). Generally, however, obtaining the kappa value averaged over the simulated confusion matrices is a more useful metric for prediction given unbalanced classes (of our 55 PUMAs, 41 achieve vaccination rate >= 70%, and only 14 do not); Our optimal model’s averaged kappa was 0.63, which is considered reasonably decent, but again suggests that the limited number of in-sample data points may result in model over-fitting.

# 1 indicates BELOW 70% vaccination rate
logistic_df = nyc_puma_summary %>% 
  mutate(
    below_herd_vax = as.factor(ifelse(covid_vax_rate >= 70, 0, 1))
  ) %>% 
  select(-puma, -total_people, -covid_hosp_rate, -covid_death_rate, -covid_vax_rate)

# Define predictors matrix
x = model.matrix(below_herd_vax ~ ., logistic_df)[,-1]

# Define outcomes
y = logistic_df$below_herd_vax
library(caret) # Note: new libraries are loaded here due to potential function masking from the `caret` library
library(mlbench)
set.seed(777)
vax_cv <- trainControl(method = "repeatedcv", number = 5, repeats = 100, 
                       savePredictions = T
                       )

# Goal is to find optimal lambda
lasso_model <- train(below_herd_vax ~ ., data = logistic_df,
                     method = "glmnet",
                     trControl = vax_cv,
                     tuneGrid = expand.grid(
                       .alpha = 1,
                       .lambda = seq(0.0001, 1, length = 100)),
                     family = "binomial")
coef <- coef(lasso_model$finalModel, lasso_model$bestTune$lambda)

sub_lasso <-
  subset(lasso_model$pred, lasso_model$pred$lambda == lasso_model$bestTune$lambda)

caret::confusionMatrix(table(sub_lasso$pred, sub_lasso$obs))
## Confusion Matrix and Statistics
## 
##    
##        0    1
##   0  731  258
##   1  369 4142
##                                           
##                Accuracy : 0.886           
##                  95% CI : (0.8773, 0.8943)
##     No Information Rate : 0.8             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6297          
##                                           
##  Mcnemar's Test P-Value : 1.118e-05       
##                                           
##             Sensitivity : 0.6645          
##             Specificity : 0.9414          
##          Pos Pred Value : 0.7391          
##          Neg Pred Value : 0.9182          
##              Prevalence : 0.2000          
##          Detection Rate : 0.1329          
##    Detection Prevalence : 0.1798          
##       Balanced Accuracy : 0.8030          
##                                           
##        'Positive' Class : 0               
## 

Notably, feature selection from our lasso regression discovered that the most important predictors of risk for sub-70% vaccination were insurance composition, employment composition, poverty composition, and welfare composition in a given PUMA – largely aligned with key correlates of vaccination noted in our exploratory analysis.

lambda  <- seq(0.0001, 1, length = 100)

lambda_opt = lasso_model$bestTune$lambda

result_plot <- broom::tidy(lasso_model$finalModel) %>% 
  select(term, lambda, estimate) %>% 
  complete(term, lambda, fill = list(estimate = 0) ) %>% 
  filter(term != "(Intercept)") %>% 
  ggplot(aes(x = log(lambda, 10), y = estimate, group = term, color = term)) + 
  geom_path() + 
  geom_vline(xintercept = log(lambda_opt, 10), color = "blue", size = 1.2) +
  theme(legend.position = "none") +
  labs(y = "Coefficient Estimate", title = "Coefficient Estimates for Varying Values of Lambda")

result_plot

Again, we wanted to move beyond binary classification to develop a risk score. Generally, in logistic regression, when a data point is predicted with probability > 0.5 to be a “1,” the model classifies it as a 1, and otherwise as a 0. We obtained risk scores not only by classifying PUMAs as above or below 70% vaccination rate using our prediction model, but by obtaining the exact probability of a given data point being a 1. For instance, if a PUMA has an 85% chance of being below 70% vaccination rate according to our classifier, its risk score would be 85, even though our model would binarily predict it to be a “1” rather than a “0.”

lambda <- lasso_model$bestTune$lambda
lasso_fit = glmnet(x, y, lambda = lambda, family = "binomial")
risk_predictions = (round((predict(lasso_fit, x, type = "response"))*100, 1))

puma <- nyc_puma_summary %>% 
  select(puma)

vax <- logistic_df %>% 
  select(below_herd_vax)

risk_prediction <- 
  bind_cols(puma, vax, as.vector(risk_predictions)) %>%
  rename(risk_prediciton = ...3)

risk_prediction %>%
  mutate(puma = fct_reorder(puma, risk_prediciton, .desc = TRUE)) %>%
  ggplot(aes(x = puma, y = risk_prediciton, fill = below_herd_vax)) + 
  geom_bar(stat  = "identity") + 
  geom_hline(yintercept = 50, linetype = "dashed", color = "red") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  labs(title = "Predicting Risk Score of Vaccination Rate across PUMA",
       x = "PUMA", y = "Predicted Risk Score", fill = "Actual Vax Status") + 
  scale_fill_viridis(discrete = TRUE, labels = c("Below 70%", "Above 70%"))

Clustering

We were also curious how statistical learning techniques would cluster our PUMAs based on predictors, and how those clusters might correspond to performance on key outcomes. We separated our data into a tibble of predictors only and a tibble of outcomes only, then fit three clusters – just to try it – on our predictors alone. We proceeded to plot each PUMA on hospitalization rate vs vaccination rate (choosing to forego death rate, in this case, given its collinearity and likely redundancy of hospitalization rate), then colored each data point by predicted clustering.

# Define tibble of predictors only
predictors = nyc_puma_summary %>% 
  select(-puma, -total_people, -covid_hosp_rate, -covid_vax_rate, -covid_death_rate)

# Define tibble of outcomes only
outcomes = nyc_puma_summary %>% 
  select(covid_hosp_rate, covid_death_rate, covid_vax_rate)

# Define tibble of pumas only
pumas = nyc_puma_summary %>% 
  select(puma)

# Fit 3 clusters on predictors
kmeans_fit = 
  kmeans(x = predictors, centers = 3)

# Add clusters to data frame of predictors and bind with PUMA and outcomes data
predictors = 
  broom::augment(kmeans_fit, predictors)

# Bind columns
full_df = cbind(pumas, outcomes, predictors)

# Summary df
summary_df = full_df %>% 
  group_by(.cluster) %>% 
  summarize(
    median_hosp = median(covid_hosp_rate),
    median_death = median(covid_death_rate),
    median_vax = median(covid_vax_rate)
  )

# Plot predictor clusters against outcomes
# Example: try hospitalization vs vaccination
ggplot(data = full_df, aes(x = covid_hosp_rate, y = covid_vax_rate, color = .cluster)) + 
  geom_point() + 
  geom_point(data = summary_df, aes(x = median_hosp, y = median_vax), color = "black", size = 4) +
  geom_point(data = summary_df, aes(x = median_hosp, y = median_vax, color = .cluster), size = 2.75) +
  labs(x = "COVID Hospitalization Rate", y = "COVID Vaccination Rate", title = "COVID Hospitalization vs Vaccination Rates for each PUMA", color = "Cluster")

Generally, our three clusters included:

  • One cluster with relatively low hospitalization rate (~5%) and relatively high vaccination rate (~75%)
  • Two clusters with relatively identical average hospitalization rate (~10%), but one with substantially higher vaccination rate (nearly 60%) than the other (nearly 50%)

For fun, we also evaluated Euclidean distance between our observed PUMAs, and alternatively mapped PUMAs onto their respective clusters after reducing our predictors to two key (principal) component dimensions.

# Scale predictors
for_clustering = predictors %>% 
  select(-.cluster) %>% 
  na.omit() %>% 
  scale()

# Evaluate Euclidean distances between observations
distance = get_dist(for_clustering)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

# Cluster with three centers
k_scaled = kmeans(for_clustering, centers = 3)

# Visualize cluster plot with reduction to two dimensions
fviz_cluster(k_scaled, data = for_clustering)

# Bind with outcomes and color clusters
full_df = for_clustering %>% 
  as_tibble() %>% 
  cbind(outcomes, pumas) %>% 
  mutate(
    cluster = k_scaled$cluster
  )

Because we settled on setting the number of clusters at three relatively arbitrarily, we decided to evaluate the clustering quality using both WSS, silhouette, and gap methods.

# Check where elbow occurs using WSS method
fviz_nbclust(for_clustering, kmeans, method = "wss")

# Check for optimal number of clusters using silhouette method
fviz_nbclust(for_clustering, kmeans, method = "silhouette")

# Check number of clusters that minimize gap statistic
gap_stat = clusGap(for_clustering, FUN = kmeans, nstart = 25, K.max = 20, B = 50)
fviz_gap_stat(gap_stat)

The elbow in our WSS plot indicates that three clusters may be optimal, whereas the silhouette plot shows an average silhouette width optimized at two clusters. Finally, the gap plot shows that clustering is actually optimized when the number of clusters equals 1 – i.e. when no clustering occurs. The relative lack of concordance between these assessments of clustering quality is no surprise given the fact that our model is fit on only 55 predictors. For completeness, we re-modeled our clustering of predictors with only k = 2 clusters, and again tried to visualize where these clusters of PUMAs were distributed on the hospitalization/vaccination rate axes:

# Scale predictors
for_clustering = predictors %>% 
  select(-.cluster) %>% 
  na.omit() %>% 
  scale()

# Evaluate Euclidean distances between observations
distance = get_dist(for_clustering)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

# Cluster with two centers
k_scaled2 = kmeans(for_clustering, centers = 2)

# Visualize cluster plot with reduction to two dimensions
fviz_cluster(k_scaled2, data = for_clustering)

# Bind with outcomes and color clusters
full_df = for_clustering %>% 
  as_tibble() %>% 
  cbind(outcomes, pumas) %>% 
  mutate(
    cluster = k_scaled2$cluster
  )

With two clusters, our low hospitalization/high vaccination cluster remains, but our two mid-level hospitalization rate clusters are condensed into one cluster.

Discussion

Project Obstacles

The primary obstacles at the outset of the project were related to data pre-processing, getting our disparate data sources to play nice with each other. As mentioned elsewhere, our outcome variables were obtained at the ZCTA-level geography while our predictor variables were obtained at the interview-level and coded to PUMA geographies. We were able to use a ZCTA-PUMA crosswalk data set from Baruch College and a bit of math to merge these data sets for further analysis.

Another obstacle was to aggregate the interview-level data to the PUMA-level geography and to tidy the IPUMS variables according to the census data dictionary. For instance, factor variables like education and race were coded numerically, but these data needed to have descriptive factor levels for analysis. This process is described in more detail in the Data Sources and Cleaning section.

These data presented a unique regression challenge as well. (TO BE COMPLETED)

Findings

In our exploratory analysis we looked at COVID-19 hospitalizations, deaths, and vaccinations across boroughs, PUMAs, and demographic combinations.

When looking at outcomes across boroughs, we saw that PUMA geographies in Queens and the Bronx tended to be above the median NYC PUMA for COVID-19 hospitalization and death rates. Interestingly, PUMA geographies in Queens were among the most highly vaccinated areas in NYC as well.

In general, we did not see notable differences of COVID-19 outcomes between males and females. When looking at demographic categories, we found that hospitalization and death rates were lowest among white males and females under 30 and that vaccination rates were highest among Asian and Pacific Islander males and females. The lowest vaccination rates were among older American Indian individuals and younger and middle-aged Black individuals.

In an analysis of correlations between predictors and outcomes across PUMA geographies, we found income to be strongly associated with COVID-19 hospitalizations (negative association), deaths (negative association), and vaccinations (positive association). We found that many indicators of socioeconomic status – including poverty level, welfare, unemployment, and food stamp utilization – seem highly negatively correlated with vaccination, but not with hospitalization and death. These findings may indicate the potential impact from significant structural inequalities that hinder healthcare access among the poor, given that vaccination is a more “active” outcome (requiring deliberate action) here than hospitalization or death, which are both the result of (in all likelihood) “passive” or indeliberate transmission.

Lastly, we saw that among boroughs, Manhattan had the most significant disparities of outcomes across demographic groups like race and age.

(REGRESSION CONCLUSIONS TO BE ADDED)

Next Steps

Based on our analysis, future research could be performed to better understand the associations we found between socioeconomic status and barriers to healthcare access, shown in the correlations between economic predictors and hospitalization, mortality, and vaccination outcomes. These relationships and the potential causal mechanisms are important to understand in the efforts to emerge from the current COVID-19 pandemic and for future pandemics that we may face.

How can we ensure that diverse populations have access and take advantage of healthcare solutions? Future research can help understand “how to turn vaccines into vaccinations” to quote Dr. Michael Osterholm.